gender and age groups.The data is contained in three files:
Import all necessary libraries
import pandas as pd
import json
import numpy as np
import datetime
from tqdm import tqdm_notebook as tqdm
import pickle
import time
# importing sys
import sys
# adding scripts to the system path
sys.path.insert(0, '/Users/fidahussainrao/Downloads/Predict customer response/scripts')
from eda import *
import plotly.io as pio
pio.renderers.default='notebook'
import warnings
warnings.filterwarnings('ignore')
def load_data(file):
return pd.read_json(f'../data/{file}.json', orient='record', lines=True)
portfolio = load_data('portfolio')
profile = load_data('profile')
transcript = load_data('transcript')
explore_df(portfolio)
Shape of the dataframe is: (10, 6)
There are no missing values in this dataframe.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 reward 10 non-null int64
1 channels 10 non-null object
2 difficulty 10 non-null int64
3 duration 10 non-null int64
4 offer_type 10 non-null object
5 id 10 non-null object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes
None
reward difficulty duration
count 10.000000 10.000000 10.000000
mean 4.200000 7.700000 6.500000
std 3.583915 5.831905 2.321398
min 0.000000 0.000000 3.000000
25% 2.000000 5.000000 5.000000
50% 4.000000 8.500000 7.000000
75% 5.000000 10.000000 7.000000
max 10.000000 20.000000 10.000000
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
# unique offer types
portfolio['offer_type'].unique()
array(['bogo', 'informational', 'discount'], dtype=object)
cols = ['reward', 'difficulty', 'duration']
plot_hist(portfolio, cols)
The above exploration shows that:
explore_df(profile)
Shape of the dataframe is: (17000, 5)
Missing values
gender 2175
age 0
id 0
became_member_on 0
income 2175
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 14825 non-null object
1 age 17000 non-null int64
2 id 17000 non-null object
3 became_member_on 17000 non-null int64
4 income 14825 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
None
age became_member_on income
count 17000.000000 1.700000e+04 14825.000000
mean 62.531412 2.016703e+07 65404.991568
std 26.738580 1.167750e+04 21598.299410
min 18.000000 2.013073e+07 30000.000000
25% 45.000000 2.016053e+07 49000.000000
50% 58.000000 2.017080e+07 64000.000000
75% 73.000000 2.017123e+07 80000.000000
max 118.000000 2.018073e+07 120000.000000
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
profile['gender'].unique()
array([None, 'F', 'M', 'O'], dtype=object)
cols = ['age', 'income']
plot_hist(profile, cols, bins=50)
The exploration above shows that:
explore_df(transcript)
Shape of the dataframe is: (306534, 4)
There are no missing values in this dataframe.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 person 306534 non-null object
1 event 306534 non-null object
2 value 306534 non-null object
3 time 306534 non-null int64
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
None
time
count 306534.000000
mean 366.382940
std 200.326314
min 0.000000
25% 186.000000
50% 408.000000
75% 528.000000
max 714.000000
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
## there is a common key in the value column for every event.
for event in transcript['event'].unique():
key_ = []
for val in transcript[transcript['event'] == event]['value']:
key_.append(val.keys())
print(f'The common key(s) in the values column that corresponds to {event.capitalize()} in the event column is/are:')
j = 0
for k in list(np.unique(key_)[0]):
print(f'{j+1} - {k.capitalize()}')
j+=1
print()
The common key(s) in the values column that corresponds to Offer received in the event column is/are: 1 - Offer id The common key(s) in the values column that corresponds to Offer viewed in the event column is/are: 1 - Offer id The common key(s) in the values column that corresponds to Transaction in the event column is/are: 1 - Amount The common key(s) in the values column that corresponds to Offer completed in the event column is/are: 1 - Offer_id 2 - Reward
list(np.unique(key_)[0])
['offer_id', 'reward']
plt.figure(figsize=(12,5))
plt.hist(transcript['time'], bins=50, color='lightblue')
plt.grid(axis='y', alpha=0.75)
plt.title('TIME');
The exploration above shows that:
## cleaning the channels column, creating dummy variables.
def one_hot_encode(item_lists):
# get all the unique channels
unique_items = []
for item_list in item_lists:
for item in item_list:
unique_items.append(item)
unique_items = list(set(unique_items))
# Create empty dict
item_dict = {}
# Loop through all the channels
for item in unique_items:
# Apply boolean mask
item_dict[item] = item_lists.apply(lambda x: item in x)
# Return the results as a dataframe
return pd.DataFrame(item_dict)
portfolio = portfolio[['id', 'offer_type', 'duration', 'difficulty', 'reward', 'channels']]
# one hot encode the channels column, and concatenate it with the remaining dataframe.
portfolio = pd.concat([portfolio, one_hot_encode(portfolio['channels'])],axis=1)
portfolio.iloc[:,-4:] = portfolio.iloc[:,-4:].astype('int')
portfolio = portfolio.drop('channels',axis=1)
portfolio.head()
| id | offer_type | duration | difficulty | reward | mobile | web | social | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | ae264e3637204a6fb9bb56bc8210ddfd | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 |
| 1 | 4d5c57ea9a6940dd891ad53e9dbe8da0 | bogo | 5 | 10 | 10 | 1 | 1 | 1 | 1 |
| 2 | 3f207df678b143eea3cee63160fa8bed | informational | 4 | 0 | 0 | 1 | 1 | 1 | 0 |
| 3 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | bogo | 7 | 5 | 5 | 1 | 1 | 1 | 0 |
| 4 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | discount | 10 | 20 | 5 | 0 | 1 | 1 | 0 |
portfolio = portfolio.reset_index()
portfolio = portfolio.rename(columns={'index': 'offer_num'})
portfolio = portfolio.rename(columns={'id': 'offer_id'})
profile[profile['age'] == profile['age'].max()].isna().sum()
gender 2175 age 0 id 0 became_member_on 0 income 2175 dtype: int64
profile = profile[profile['age'] != profile['age'].max()]
profile.isna().sum()
gender 0 age 0 id 0 became_member_on 0 income 0 dtype: int64
## cleaning the date column
profile['became_member_on'] = profile['became_member_on'].apply(lambda x:datetime.datetime.strptime(str(x), '%Y%m%d'))
profile['membership_month'] = profile['became_member_on'].dt.month
profile['membership_year'] = profile['became_member_on'].dt.year
profile['membership_day'] = profile['became_member_on'].dt.day
profile['member_since'] = (datetime.datetime.now() - profile['became_member_on']).dt.days
profile = profile.drop('became_member_on', axis=1)
profile = profile.rename(columns={'id': 'person_id'})
transcript = transcript.rename(columns={'person': 'person_id'})
transcript = pd.merge(transcript, profile['person_id'], on='person_id', how='inner')
# convert time from hours to days
transcript['time'] = transcript['time'] / 24
transcript['event'].unique()
array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
dtype=object)
offers = pd.DataFrame.from_records(transcript['value'])
offers.head()
| offer id | amount | offer_id | reward | |
|---|---|---|---|---|
| 0 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN | NaN | NaN |
| 1 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN | NaN | NaN |
| 2 | NaN | 19.89 | NaN | NaN |
| 3 | NaN | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5.0 |
| 4 | NaN | 17.78 | NaN | NaN |
offers['offer_id'] = np.where(offers['offer id'].isnull() & offers['offer_id'].notnull(),
offers['offer_id'], offers['offer id'])
offers = offers.drop('offer id', axis=1)
offers.head(3)
| amount | offer_id | reward | |
|---|---|---|---|
| 0 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 1 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 2 | 19.89 | NaN | NaN |
transcript = pd.concat([transcript.drop('value',axis=1),offers],axis=1)
transcript.head()
| person_id | event | time | amount | offer_id | reward | |
|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0.00 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 1 | 78afa995795e4d85b5d9ceeca43f5fef | offer viewed | 0.25 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 2 | 78afa995795e4d85b5d9ceeca43f5fef | transaction | 5.50 | 19.89 | NaN | NaN |
| 3 | 78afa995795e4d85b5d9ceeca43f5fef | offer completed | 5.50 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5.0 |
| 4 | 78afa995795e4d85b5d9ceeca43f5fef | transaction | 6.00 | 17.78 | NaN | NaN |
profile['age'].describe()
count 14825.000000 mean 54.393524 std 17.383705 min 18.000000 25% 42.000000 50% 55.000000 75% 66.000000 max 101.000000 Name: age, dtype: float64
An average Starbucks customer is 54.4 years old.
plt.figure(figsize=(15,6))
profile['age'].plot.hist(grid=True, bins=profile['age'].nunique(), rwidth=0.9,
color='lightslategray')
plt.title('What is the average age of the customer?')
plt.xlabel('Age')
plt.ylabel('Count')
plt.vlines(x = profile['age'].mean(), ymin=0, ymax=450, linestyles='--', linewidth=3, color='k')
plt.text(profile['age'].mean()+.5, 440, 'Mean', ha ='left', va ='center')
plt.grid(axis='y', alpha=0.75)
# gaussian shape centered on the mean value
# getting the normalized gender distribution
gender_dist = 100 * profile['gender'].value_counts(normalize=True).sort_values()
gender_dist
O 1.430017 F 41.342327 M 57.227656 Name: gender, dtype: float64
x = gender_dist.index
y = gender_dist
plot_bar_chart(x, y, title='Normalized gender distribution of starbucks customers (%).')
Analyzing the total instances of the offer completion, offer viewed, offer received and transaction.
event_dist = transcript['event'].value_counts().sort_values()
x = event_dist.index
y = event_dist
plot_bar_chart(x, y, title='Event Distribution')
The offer's success can be determined by dividing the total instances of completed offer by the received offer. Plot the results on a bar chart.
# getting the total instances of each event.
# To get the total number of events for each offer id, we will group by both event and offer id.
offersbyEvent = transcript.groupby(['event', 'offer_id']).count().iloc[:,0].reset_index()
# use pivot function to obtain the table in the required form.
# drop nan values if any.
offersbyEvent = pd.pivot(data=offersbyEvent, index='offer_id',
columns='event', values='person_id' )[['offer completed', 'offer received']].dropna()
# divinding offer completed by offer received for each offer.
offersbyEvent = offersbyEvent['offer completed'] / offersbyEvent['offer received']
# create a dictionary to map the offer id to their respective offer number.
id_num = dict(zip(portfolio.offer_id, portfolio.offer_num))
offersbyEvent.index = offersbyEvent.index.map(id_num)
offersbyEvent = offersbyEvent.sort_values(ascending=False)
x = offersbyEvent.index.astype(str)
y = offersbyEvent * 100
title='Which are the most Successful offers?'
plot_bar_chart(x, y, title)
The bar chart above shows that the best offer in the portfolio, in terms of successfullness, is the offer number 6. After that, the other successful offers are the offer number 5, 8, and 3.
In this section, I will calculate the success rate of each offer by age group. The same method as in the previous section can be used to calculate the success rate. Finally, use a grouped bar chart to visualise the results. I'll create a helper function to create a grouped bar chart in case it's required later.
profile['age'].describe()
count 14825.000000 mean 54.393524 std 17.383705 min 18.000000 25% 42.000000 50% 55.000000 75% 66.000000 max 101.000000 Name: age, dtype: float64
Obtaining successful offers using all possible ages is computationally expensive and ineffective. As a result, we will categorise customers' ages by decades.
profile['age_group'] = pd.cut(profile['age'], bins=range(10,111,10) )
profile['age_group'].unique()
[(50, 60], (70, 80], (60, 70], (20, 30], (40, 50], (30, 40], (90, 100], (10, 20], (80, 90], (100, 110]] Categories (10, interval[int64, right]): [(10, 20] < (20, 30] < (30, 40] < (40, 50] ... (70, 80] < (80, 90] < (90, 100] < (100, 110]]
# create a dictionary to map person_id to their respective age groups
id_agegroup = dict(zip(profile.person_id, profile.age_group))
transcript['age'] = transcript['person_id'].map(id_agegroup)
print(transcript['age'].value_counts())
transcript.head()
(50, 60] 64707 (60, 70] 52308 (40, 50] 43611 (30, 40] 31532 (70, 80] 29216 (20, 30] 27037 (80, 90] 13855 (10, 20] 6544 (90, 100] 3854 (100, 110] 98 Name: age, dtype: int64
| person_id | event | time | amount | offer_id | reward | age | |
|---|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0.00 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN | (70, 80] |
| 1 | 78afa995795e4d85b5d9ceeca43f5fef | offer viewed | 0.25 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN | (70, 80] |
| 2 | 78afa995795e4d85b5d9ceeca43f5fef | transaction | 5.50 | 19.89 | NaN | NaN | (70, 80] |
| 3 | 78afa995795e4d85b5d9ceeca43f5fef | offer completed | 5.50 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5.0 | (70, 80] |
| 4 | 78afa995795e4d85b5d9ceeca43f5fef | transaction | 6.00 | 17.78 | NaN | NaN | (70, 80] |
# getting the total count of each event grouped by the offer id and the age groups.
offersbyAge = transcript.groupby(['age', 'event', 'offer_id']).count().iloc[:,0].reset_index()
# getting the table in the required format.
offersbyAge = pd.pivot(data=offersbyAge, index=['age','offer_id'], columns='event', values='person_id' )
# divide the offer completed by the offer received to obtain the success rate of each offer.
offersbyAge = (offersbyAge['offer completed'] / offersbyAge['offer received']).dropna().reset_index()
offersbyAge = offersbyAge.rename(columns={0:'success_rate'})
offersbyAge['offer_num'] = offersbyAge['offer_id'].map(id_num)
offersbyAge = offersbyAge.sort_values(by=['age', 'offer_num'])
title = 'Which are the most successful offer for each age group?'
plot_groupedBarChart(offersbyAge, 'age', title)
The graph above shows the most successful offers for each age group, that are:
Now, I will calculate the success rate of each offer with respect to the genders. We have three genders in the profile dataset, they are Male, Female, and others. Lastly, visualize the results using the helper function created in the previous section.
# create a dictionary to map person_id to their respective genders.
id_gender = dict(zip(profile.person_id, profile.gender))
transcript['gender'] = transcript['person_id'].map(id_gender)
transcript['gender'].value_counts()
M 155690 F 113101 O 3971 Name: gender, dtype: int64
We can easily calculate the success rate for each gender using the same method we used to calculate it for different age groups.
offersbyGender = transcript.groupby(['gender', 'event', 'offer_id']).count().iloc[:,0].reset_index()
offersbyGender = pd.pivot(data=offersbyGender, index=['gender','offer_id'], columns='event', values='person_id' )
offersbyGender = (offersbyGender['offer completed'] / offersbyGender['offer received']).dropna().reset_index()
offersbyGender = offersbyGender.rename(columns={0:'success_rate'})
offersbyGender['offer_num'] = offersbyGender['offer_id'].map(id_num)
offersbyGender = offersbyGender.sort_values(by=['gender', 'offer_num'])
title = 'Which are the most successful offer for each gender?'
plot_groupedBarChart(offersbyGender, 'gender', title)
The graph above shows the most successful offers for each gender, that are:
plot_hist(profile, ['member_since', 'membership_year', 'membership_month'])
In order to determine what an successful offer is, I will further explore all the three datasets and see how they interact with each other. Firstly, exploring what kind of event are there for each offer type.
a = pd.merge(transcript[['offer_id','event','time']],
portfolio[['offer_id', 'offer_type']],
on='offer_id', how='left')
a.groupby(['offer_type', 'event']).count().iloc[:,1]
offer_type event
bogo offer completed 15258
offer received 26537
offer viewed 22039
discount offer completed 17186
offer received 26664
offer viewed 18461
informational offer received 13300
offer viewed 9360
Name: time, dtype: int64
There are four types of events in our data: offer received, offer viewed, transaction, and offer completed. However, as indicated above, the transaction is not related with any offer type. The transcript dataframe contains null values for Offer ID and Offer type for the transaction. As a result, we'll need to figure out how to populate the Offer IDs for transactions.
Sorting the transcript dataset by person and time to ensure that each event for each person occurs in sequence.
transcript_processed = transcript.sort_values(by=['person_id', 'time'])
transcript_processed = transcript_processed.reset_index(drop=True)
def clean_transactions(data):
"""
In the transcript table, offer IDs are null when the corressponding event
is transaction. This function will fill in the missing offer ids.
INPUT:
data: dataframe containing the transactional records and offer profile
for a customer.
OUTPUT:
data: Cleaned dataframe, with no null offer IDs.
"""
data['prev_person_id'] = data['person_id'].shift()
data['offer_id'] = np.where((data['offer_id'].isnull())
& (data['prev_person_id'] == data['person_id']),
data['offer_id'].shift(),
data['offer_id'])
# if a customer has made transaction without even receiving the offer, the offer_id will still remain NaN.
# We will drop these records, since there is no way of filling these offer IDS. Plus, these are unsuccessful
# offers since the customer has made the transaction w/o any knowledge of the offer.
data = data.dropna(subset='offer_id')
data = data.drop('prev_person_id', axis=1)
return data
transcript_processed = clean_transactions(transcript_processed)
transcript_processed.isna().sum()
person_id 0 event 0 time 0 amount 148805 offer_id 0 reward 188174 age 0 gender 0 dtype: int64
Moreover, it can also be noted that the BOGO and discount have the offer completed event once the offer is completed. However, the informational event does not have this event associated with it. Now we can define the successful offer as follows:
Group 1
BOGO and discount offer type the event has to follow the given sequence for it to be considered a successful offer:offer received -> offer viewed -> transaction -> offer completed
offer received -> offer viewed -> transaction (Informational offers must be completed within the given time duration)
In following cases the offer will be considered ineffective:
Group2
Offer was received and viewed but no response (unsuccessful).
offer received -> offer viewedGroup 3
Customers who purchased/completed the offer without being aware of it. If an offer is completed or a transaction occurs prior to an offer being viewed, or if an offer is completed after the offer is viewed, but a transaction occurred prior to the offer being viewed. The offer may have been completed in these cases, but it is not a successful conversion.
a. For BOGO and Discount offer type
offer received -> transaction -> offer completed-> offer viewedb. For Informational offer type
offer received -> transaction -> offer viewedGroup 4
Offer received but not viewed, hence no further action.
offer receiveddef valid_offer(data):
"""
Determine whether the offer was completed within the given time duration.
INPUT:
data: dataframe containing the transactional records and offer profile
of an offer for a customer.
OUTPUT:
offer_df: Additional column called valid_offer appended to the given
dataframe.
"""
offer_df = data.copy()
valid = []
limit = offer_df['duration'].unique()
for idx in range(len(offer_df)):
if offer_df['event'].iloc[idx] == 'offer received':
start = offer_df.iloc[idx]['time']
end = start + limit
valid.append((offer_df.iloc[idx]['time'] < end).astype('int')[0])
offer_df['valid_offer'] = valid
return offer_df
def define_offer_success(data, offer_type):
"""
Determine whether the offer was successfully completed.
INPUT:
data: dataframe containing the transactional records and offer profile
of an offer for a customer.
offer_type: type of the given offer from informational, BOGO, discount.
OUTPUT:
offer_df: Additional column called offer_success appended to the given
dataframe.
"""
offer_df = data.copy()
if offer_type == 'informational':
offer_df['offer_success'] = np.where((offer_df['event'] == 'transaction') & \
(offer_df['prev_event_1'] == 'offer viewed') & \
(offer_df['prev_event_2'] == 'offer received') & \
(offer_df['valid_offer'] == True), 1,0)
elif offer_type in ['discount', 'bogo']:
offer_df['offer_success'] = np.where((offer_df['event'] == 'offer completed') & \
(offer_df['prev_event_1'] == 'transaction') & \
(offer_df['prev_event_2'] == 'offer viewed') & \
(offer_df['prev_event_3'] == 'offer received') & \
(offer_df['valid_offer'] == True),1,0)
return offer_df
It has been discovered that some customers have received the same offer many times. However, these clients did not always complete the offer. If I add every instance of completed and uncompleted offers in the final dataframe for the customer, offer combination, our model will become confused. As a result, I've chosen to return only one record for each customer, offer pair, and if the customer finished the offer even once, the offer will be successful; otherwise, it will fail. In addition, I will include a total offers received column to assess how many offers were required for that consumer to finish the offer.
NOTE: By including the total offers received, I completely rule out the possibility that a consumer completed the same offer more than once. I considered including a column for total offers completed; however, this could result in data leakage.
def offer_success():
"""
Creates a dictionaries that describes the effectiveness of each
offers to a specific customer. An offer will be effective if the
customer has completed it once.
OUTPUT:
d: A dictionary containing the offer_id, person_id, and offer_success.
"""
person_ids = transcript_processed['person_id'].unique()
d = dict()
d['offer_success'], d['offer_id'], d['person_id'] = [], [], []
for person_id in tqdm(person_ids):
person_df = transcript_processed[transcript_processed['person_id'] == person_id]
# merge with portfolio column
person_df = pd.merge(person_df, portfolio[['offer_id', 'offer_type', 'duration']], on='offer_id', how='left')
offer_ids = person_df['offer_id'].unique()
for i in range(len(offer_ids)):
offer_df = person_df[person_df['offer_id'] == offer_ids[i]]
offer_df['prev_event_1'], offer_df['prev_event_2'], offer_df['prev_event_3'] = offer_df['event'].shift(), \
offer_df['event'].shift(2),\
offer_df['event'].shift(3)
# check whether the offer was completed within the given time.
offer_df = valid_offer(offer_df)
# getting the offer type of the current offer.
offer_type = offer_df['offer_type'].unique()[0]
# checking whether the current offer was successful or not
offer_df = define_offer_success(offer_df, offer_type)
# add the offer_success, offer_id, and person_id to the dictionary
d['offer_success'].append(offer_df['offer_success'].max())
d['offer_id'].append(offer_df['offer_id'].unique()[0])
d['person_id'].append(offer_df['person_id'].unique()[0])
return d
# d = offer_success()
0%| | 0/14820 [00:00<?, ?it/s]
Save the dictionary obtained after running the following function, since it is a time comsuming and inconvenient to run it every time the notebook is refreshed. We can simply load the dictionary from the storage if the notebook is restarted.
# with open('../data/cleaned/offer_success.pkl', 'wb') as f:
# pickle.dump(d, f)
with open('../data/cleaned/offer_success.pkl', 'rb') as f:
loaded_dict = pickle.load(f)
df = pd.DataFrame.from_dict(loaded_dict)
df.head()
| offer_success | offer_id | person_id | |
|---|---|---|---|
| 0 | 1 | 5a8bc65990b245e5a138643cd4eb9837 | 0009655768c64bdeb2e877511632db8f |
| 1 | 0 | 3f207df678b143eea3cee63160fa8bed | 0009655768c64bdeb2e877511632db8f |
| 2 | 0 | f19421c1d4aa40978ebb69ca19b0e20d | 0009655768c64bdeb2e877511632db8f |
| 3 | 0 | fafdcd668e3743c1bb461111dcafc2a4 | 0009655768c64bdeb2e877511632db8f |
| 4 | 0 | 2906b810c7d4411798c6938adc9daaa5 | 0009655768c64bdeb2e877511632db8f |
df['offer_success'].value_counts()
0 32549 1 22673 Name: offer_success, dtype: int64
We will be getting unsuccessful offers now that we have gotten successful offers. Unsuccessful offers are ones that were viewed but no action was taken or were completed after the time limit had passed.
First and foremost, I will define a function that will return all offers that were completed after the specified time limit. These offers will be deemed failed. Following that, we will retrieve the remaining customers, offer pairs, and develop a function to retrieve those offers that were viewed but not completed.
ineffective_offers = df[df['offer_success'] == 0]
effective_offers = df[df['offer_success'] == 1]
Approach I'll combine the events of all the customer, offer pairs in a single string to obtain the unsuccessful offers. Following that, I'll look for the following substring in the event column:
offer received -> offer viewed -> offer receivedoffer received -> offer viewed Substring 1 returns true if the customer received the same offer more than once, whereas substring 2 returns true if the customer only received the offer once or if it was the last offer received that he couldn't complete.
NOTE: Above steps would only be performed on the valid offers (completed within time limit) which were initially deemed ineffective in section ai.
def invalid_offer():
"""
Creates a dictionaries that will return those customers
who completed their offers after the given time limit.
OUTPUT:
d: A dictionary containing the offer_id, person_id, and valid_offer.
"""
person_ids = ineffective_offers['person_id'].unique()
d = dict()
d['offer_invalid'], d['offer_id'], d['person_id'] = [], [], []
for person_id in tqdm(person_ids):
person_df = ineffective_offers[ineffective_offers['person_id'] == person_id]
# merge with portfolio column
person_df = pd.merge(person_df, portfolio[['offer_id', 'offer_type', 'duration']],
on='offer_id', how='left')
person_df = pd.merge(person_df, transcript_processed[['event','time', 'person_id', 'offer_id']],
on=['person_id', 'offer_id'], how='inner')
offer_ids = person_df['offer_id'].unique()
for i in range(len(offer_ids)):
offer_df = person_df[person_df['offer_id'] == offer_ids[i]]
# check whether the offer was completed within the given time.
offer_df = valid_offer(offer_df)
offer_df['offer_invalid'] = np.where(offer_df['valid_offer'] == False, 1,0)
# add the offer_success, offer_id, and person_id to the dictionary
d['offer_invalid'].append(offer_df['offer_invalid'].max())
d['offer_id'].append(offer_df['offer_id'].unique()[0])
d['person_id'].append(offer_df['person_id'].unique()[0])
return d
def get_group2_offers(string, informational=False):
"""
Get all the offers which were viewed by the customer, but no
further action was taken.
INPUT:
string: combined events for a customer,offer pair
informational: True if the offer_type is informational, else
False.
OUTPUT:
boolean variable: returns 1 if customer viewed the offer but did
not take any action, otherwise returns 0.
"""
substring_1 = 'offer received,offer viewed,offer received'
substring_2 = 'offer received,offer viewed'
if substring_1 in string or string.endswith(substring_2):
return 1
elif informational == False:
# invalid offer -- not completed within the given time
if 'transaction' in string and 'offer completed' not in string:
return 1
else:
return 0
else:
return 0
def combine_event(val_df):
'''
Combine all the events of a customer, offer pair in a string.
INPUT:
val_df: dataframe containing the offers which were completed in
due time.
'''
val_df = val_df[['person_id','offer_id','event']].groupby(['person_id',
'offer_id'])['event'].apply(','.join).reset_index()
return val_df
Save the following dictionary in a pickle file so we don't have to run the above function every time we load the notebook.
# d = invalid_offer()
# with open('../data/cleaned/invalid_offer.pkl', 'wb') as f:
# pickle.dump(d, f)
0%| | 0/13291 [00:00<?, ?it/s]
# load the dictionary containing the valid and invalid offers
with open('../data/cleaned/invalid_offer.pkl', 'rb') as f:
loaded_dict = pickle.load(f)
ineffective_df = pd.DataFrame.from_dict(loaded_dict)
## get the valid offers to check for those customers who didnot complete the offer after viewing it.
valid_offers = ineffective_df[ineffective_df['offer_invalid']==0]
## merge the transactional records for each of the customer
valid_offers = pd.merge(transcript_processed, valid_offers.iloc[:,1:],
on=['offer_id', 'person_id'], how='inner')
## add offer type, given the offer ID
offertype = portfolio[['offer_id', 'offer_type']]
valid_offers = pd.merge(valid_offers, offertype, on='offer_id', how='inner')
## split the data: one containing the informational offer type, other containing the remaining offer types.
# also combine the events into a string.
non_informational = combine_event(valid_offers[valid_offers['offer_type'] != 'informational'])
informational = combine_event(valid_offers[valid_offers['offer_type'] == 'informational'])
# Finally getting the customers who didn't complete their offer.
non_informational['unsuccessful'] = non_informational['event'].apply(lambda x: get_group2_offers(x))
informational['unsuccessful'] = informational['event'].apply(lambda x: \
get_group2_offers(x, informational=True))
# concatenate the different offer types, and the invalid offers together.
# introduce a variable 'offer success' and give it the value 0, before concatenating this table with the
# successful offers
invalid_offers = ineffective_df[ineffective_df['offer_invalid']==1]
unsuccessful_df = pd.concat([non_informational[non_informational['unsuccessful']==1][['person_id', 'offer_id']],
informational[informational['unsuccessful']==1][['person_id', 'offer_id']],
invalid_offers[['person_id', 'offer_id']]],axis=0).iloc[:,:2].reset_index(drop=True)
unsuccessful_df['offer_success'] = 0
df = pd.concat([effective_offers, unsuccessful_df]).sample(frac=1).reset_index(drop=True)
df.head()
| offer_success | offer_id | person_id | |
|---|---|---|---|
| 0 | 1 | f19421c1d4aa40978ebb69ca19b0e20d | 18f6d43ce45c41b398606753c3412f60 |
| 1 | 0 | 3f207df678b143eea3cee63160fa8bed | dce9bfaf37e64b75b5c40866cfe71c80 |
| 2 | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 5113f5c8e57b4a0d9b42d4919562abd0 |
| 3 | 1 | f19421c1d4aa40978ebb69ca19b0e20d | ac18d9e07ed24f9ebd269082390bedb5 |
| 4 | 1 | 2906b810c7d4411798c6938adc9daaa5 | 055eab487b874f38b65f3edf31b5eb41 |
Now getting all the customers who didnot view the offer. There are two groups of customers in this category:
These customer-offer pairs will not be utilised to train a machine learning model, but we will use them to analyse the consumer profile and save the costs associated with sending them the offers. It is meaningless to send them the offers since group 3 consumers will complete the offer regardless of whether they receive the offer, and group 4 customers will not make the transaction because they are not viewing the offer. Group 4 customers can be targetted on some other mediums where they are more likely to view the offer.
In the following function, we will identify customers from groups 3 and 4 from the aforementioned dataframe for those who did not fall into group 2. Customers in Group 4 can be easily identified because they will have no further occurrences other than 'offer received.' Group 3 customers, on the other hand, can be searched for the 'offer completed' event. Because we have already extracted the successful offers, the remaining customers who have completed the offer are most likely in group 3. We shall look for a 'transaction' event for informational offers.
def grp3_4(string, informational=False):
"""
Get all offers which were not even viewed.
INPUT:
string: combined events for a customer,offer pair
informational: True if the offer_type is informational, else
False.
OUTPUT:
integer: returns 3 if customer completed the offer,
otherwise returns 4.
"""
if informational == False:
substring = 'offer completed'
# customers who completed the offer w/o viewing the offer.
if substring in string:
return 3
# customer who didn't take any action after receiving the order.
elif 'offer completed' not in string and 'offer viewed' not in string and 'transaction' not in string :
return 4
elif informational == True:
# customers who made the transaction w/o viewing the offer.
substring = 'transaction'
if substring in string:
return 3
elif 'offer viewed' not in string and 'transaction' not in string:
return 4
# getting all the customer, offer pairs who didnot view the offer for both Informational and non-Informational
# offer type.
grp3_4_I = informational[informational['unsuccessful'] == 0]
grp3_4_BD = non_informational[non_informational['unsuccessful'] == 0]
# Differentiate b/w the customers completed the offer w/o viewing it (3)
# and those who didn't view the offer, and no further action (4).
grp3_4_BD['group'] = grp3_4_BD['event'].apply(lambda x: grp3_4(x))
grp3_4_I['group'] = grp3_4_I['event'].apply(lambda x: grp3_4(x, informational=True))
# concatenate both offer types
grp3_4_df = pd.concat([grp3_4_BD, grp3_4_I]).drop('unsuccessful',axis=1)
grp3_4_df.head()
| person_id | offer_id | event | group | |
|---|---|---|---|---|
| 0 | 0009655768c64bdeb2e877511632db8f | 2906b810c7d4411798c6938adc9daaa5 | offer received,transaction,offer completed,tra... | 3 |
| 1 | 0009655768c64bdeb2e877511632db8f | f19421c1d4aa40978ebb69ca19b0e20d | offer received,transaction,offer completed,off... | 3 |
| 2 | 0009655768c64bdeb2e877511632db8f | fafdcd668e3743c1bb461111dcafc2a4 | offer received,transaction,offer completed,off... | 3 |
| 3 | 0020c2b971eb4e9188eac86d93036a77 | ae264e3637204a6fb9bb56bc8210ddfd | offer received | 4 |
| 4 | 003d66b6608740288d6cc97a6903f4f0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | offer received,transaction,offer completed | 3 |
Exploring Group 3 Customers
grp3_df = grp3_4_df[grp3_4_df['group']==3][['person_id', 'offer_id']]
grp3_df = pd.merge(grp3_df, portfolio, on='offer_id', how='inner')
grp3_df = pd.merge(grp3_df, profile, on='person_id', how='inner')
grp3_df.head()
| person_id | offer_id | offer_num | offer_type | duration | difficulty | reward | mobile | web | social | gender | age | income | membership_month | membership_year | membership_day | member_since | age_group | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0009655768c64bdeb2e877511632db8f | 2906b810c7d4411798c6938adc9daaa5 | 9 | discount | 7 | 10 | 2 | 1 | 1 | 1 | 0 | M | 33 | 72000.0 | 4 | 2017 | 21 | 1913 | (30, 40] |
| 1 | 0009655768c64bdeb2e877511632db8f | f19421c1d4aa40978ebb69ca19b0e20d | 8 | bogo | 5 | 5 | 5 | 1 | 1 | 1 | 1 | M | 33 | 72000.0 | 4 | 2017 | 21 | 1913 | (30, 40] |
| 2 | 0009655768c64bdeb2e877511632db8f | fafdcd668e3743c1bb461111dcafc2a4 | 6 | discount | 10 | 10 | 2 | 1 | 1 | 1 | 1 | M | 33 | 72000.0 | 4 | 2017 | 21 | 1913 | (30, 40] |
| 3 | 00715b6e55c3431cb56ff7307eb19675 | 2906b810c7d4411798c6938adc9daaa5 | 9 | discount | 7 | 10 | 2 | 1 | 1 | 1 | 0 | F | 58 | 119000.0 | 12 | 2017 | 7 | 1683 | (50, 60] |
| 4 | 00715b6e55c3431cb56ff7307eb19675 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 4 | discount | 10 | 20 | 5 | 0 | 1 | 1 | 0 | F | 58 | 119000.0 | 12 | 2017 | 7 | 1683 | (50, 60] |
plot_hist(grp3_df, ['income', 'age'])
grp3_df.describe()
| offer_num | duration | difficulty | reward | mobile | web | social | age | income | membership_month | membership_year | membership_day | member_since | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.0 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 | 10533.000000 |
| mean | 4.394854 | 6.780404 | 8.220735 | 3.700275 | 0.825881 | 1.0 | 0.854078 | 0.365043 | 54.892528 | 67157.599924 | 6.740720 | 2016.451628 | 15.945410 | 2034.596601 |
| std | 2.746624 | 2.233669 | 6.549410 | 3.016488 | 0.379230 | 0.0 | 0.353045 | 0.481465 | 17.728623 | 22771.554907 | 3.520394 | 1.175964 | 8.709264 | 411.946622 |
| min | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 | 18.000000 | 30000.000000 | 1.000000 | 2013.000000 | 1.000000 | 1452.000000 |
| 25% | 2.000000 | 5.000000 | 5.000000 | 2.000000 | 1.000000 | 1.0 | 1.000000 | 0.000000 | 43.000000 | 49000.000000 | 4.000000 | 2016.000000 | 8.000000 | 1706.000000 |
| 50% | 4.000000 | 7.000000 | 7.000000 | 5.000000 | 1.000000 | 1.0 | 1.000000 | 0.000000 | 56.000000 | 65000.000000 | 7.000000 | 2017.000000 | 16.000000 | 1949.000000 |
| 75% | 7.000000 | 7.000000 | 10.000000 | 5.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 67.000000 | 83000.000000 | 10.000000 | 2017.000000 | 24.000000 | 2304.000000 |
| max | 9.000000 | 10.000000 | 20.000000 | 10.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 101.000000 | 120000.000000 | 12.000000 | 2018.000000 | 31.000000 | 3275.000000 |
grp3_df[['email', 'social', 'web', 'mobile']].sum()
email 10533 social 3845 web 8996 mobile 8699 dtype: int64
plot_bar_sub(grp3_df, ['offer_num', 'offer_type', 'membership_year', 'gender'])
Exploring Group 4 customers
grp4_df = grp3_4_df[grp3_4_df['group']==4][['person_id', 'offer_id']]
grp4_df = pd.merge(grp4_df, portfolio, on='offer_id', how='inner')
grp4_df = pd.merge(grp4_df, profile, on='person_id', how='inner')
grp4_df.head()
| person_id | offer_id | offer_num | offer_type | duration | difficulty | reward | mobile | web | social | gender | age | income | membership_month | membership_year | membership_day | member_since | age_group | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0020c2b971eb4e9188eac86d93036a77 | ae264e3637204a6fb9bb56bc8210ddfd | 0 | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 | F | 59 | 90000.0 | 3 | 2016 | 4 | 2326 | (50, 60] |
| 1 | 00e52682848542c3a6f59b7824e9a5c5 | ae264e3637204a6fb9bb56bc8210ddfd | 0 | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 | M | 40 | 65000.0 | 10 | 2016 | 18 | 2098 | (30, 40] |
| 2 | 00ed7e22b32749cfafbfd88592d401d4 | ae264e3637204a6fb9bb56bc8210ddfd | 0 | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 | M | 55 | 94000.0 | 7 | 2016 | 31 | 2177 | (50, 60] |
| 3 | 01925607d99c460996c281f17cdbb9e2 | ae264e3637204a6fb9bb56bc8210ddfd | 0 | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 | F | 57 | 116000.0 | 11 | 2015 | 19 | 2432 | (50, 60] |
| 4 | 01e8739670a042b3877f8e843bdf55a7 | ae264e3637204a6fb9bb56bc8210ddfd | 0 | bogo | 7 | 10 | 10 | 1 | 1 | 0 | 1 | F | 54 | 119000.0 | 10 | 2015 | 24 | 2458 | (50, 60] |
# have a look at the mediums they were targetted
grp4_df[['email', 'social', 'web', 'mobile']].sum()
email 3664 social 644 web 3248 mobile 2787 dtype: int64
plot_bar_sub(grp3_df, ['email', 'social', 'web', 'mobile'])
Percent success of any offer may be calculated by determing the total number of times an offer was successfully completed from the aforementioned dataframe and dividing it by the total number of times the offer was issued to the consumer.
During EDA, we calculated percent success of a particular offer by taking the ratio of total number of completed offer to offer received. However, that approach might prove to be ineffective since there are many offers which were completed before the customer even viewed the offer. My previous approach would have considered this to be a successful offer, which it is not.
However, the results below still look very similar to what we obtained during the EDA stage.
percent_success = df.groupby('offer_id').sum()['offer_success'] / df.groupby('offer_id').count()['person_id']
percent_success = percent_success.rename('percent_success').reset_index()
portfolio_percentsuccess = pd.merge(percent_success, portfolio, on='offer_id', how='inner')
plot_bar_chart(portfolio_percentsuccess['offer_num'], y=portfolio_percentsuccess['percent_success'],
title='Percentage of offer success')
# merge the engineered feature with the dataframe.
df = pd.merge(df, percent_success, on='offer_id', how='inner')
Calculate how many times a customer was sent the same offer. This can be done by slicing the offer_received values from the event column in the transcript dataset. Furthermore, we can groupby the resulting data on the person_id and the customer_id to obtain the number of times a customer was sent the same offer.
Lastly, plot the results on a barchart.
tot_offers = pd.get_dummies(transcript_processed,columns=['event'])
tot_offers = tot_offers[tot_offers['event_offer received'] == 1]
tot_offers = tot_offers.groupby(['person_id', 'offer_id']).count()['event_offer received'].reset_index()
tot_offers = tot_offers.rename(columns={'event_offer received': 'total_similar_offers_received'})
tot_offers.head(3)
| person_id | offer_id | total_similar_offers_received | |
|---|---|---|---|
| 0 | 0009655768c64bdeb2e877511632db8f | 2906b810c7d4411798c6938adc9daaa5 | 1 |
| 1 | 0009655768c64bdeb2e877511632db8f | 3f207df678b143eea3cee63160fa8bed | 1 |
| 2 | 0009655768c64bdeb2e877511632db8f | 5a8bc65990b245e5a138643cd4eb9837 | 1 |
tot_offers_count = tot_offers['total_similar_offers_received'].value_counts()
x = tot_offers_count.index
y = tot_offers_count
plot_bar_chart(x, y)
Finally, I'll determine how many times a customer received an offer. This could be a beneficial feature for our machine learning model because it makes sense to send repeated offers to a consumer who has already completed the previous offers.
totoffer_by_person = tot_offers.groupby('person_id').sum().reset_index()
totoffer_by_person.columns = ['person_id', 'total_offers_received']
totoffer_by_person_count = totoffer_by_person['total_offers_received'].value_counts()
x = totoffer_by_person_count.index
y = totoffer_by_person_count
plot_bar_chart(x,y)
totoffer_by_person.head(3)
| person_id | total_offers_received | |
|---|---|---|
| 0 | 0009655768c64bdeb2e877511632db8f | 5 |
| 1 | 0011e0d4e6b944f998e987f904e8c1e5 | 5 |
| 2 | 0020c2b971eb4e9188eac86d93036a77 | 5 |
Merge with dataframe created above to include the engineered features.
df = pd.merge(df, tot_offers, on=['offer_id','person_id'], how='inner')
df = pd.merge(df, totoffer_by_person, on='person_id', how='inner')
df.head(3)
| offer_success | offer_id | person_id | percent_success | total_similar_offers_received | total_offers_received | |
|---|---|---|---|---|---|---|
| 0 | 1 | f19421c1d4aa40978ebb69ca19b0e20d | 18f6d43ce45c41b398606753c3412f60 | 0.587763 | 1 | 6 |
| 1 | 0 | 3f207df678b143eea3cee63160fa8bed | 18f6d43ce45c41b398606753c3412f60 | 0.464761 | 1 | 6 |
| 2 | 1 | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 18f6d43ce45c41b398606753c3412f60 | 0.459239 | 1 | 6 |
## save the required csv files for model implementation
df.to_csv('../data/cleaned/feat_engineered.csv', index=False)
profile.drop('age_group', axis=1).to_csv('../data/cleaned/profile.csv', index=False)
portfolio.to_csv('../data/cleaned/portfolio.csv', index=False)
transcript_processed.to_csv('../data/cleaned/transcript_processed.csv', index=False)
age of a starbuck's customer is almost 55 years.highest overall success percentage is offer number 6. As a result, if Starbucks wants to focus on a single offer, that offer is unquestionably the number 6. under 30 is the offer number 5.above 30 is the offer number 6.elderly are numbered 1 and 8. male and female customers is the offer number 6.same offer only once.